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Abstract —To date, there have been massive Semi-Structured Documents (SSDs) during the evolution of the Internet. These 
SSDs contain both unstructured features (e.g., plain text) and metadata (e.g., tags). Most previous works focused on modeling 
the unstructured text, and recently, some other methods have been proposed to model the unstructured text with specific tags. 
To build a general model for SSDs remains an important problem in terms of both model fitness and efficiency. We propose a 
novel method to model the SSDs by a so-called Tag-Weighted Topic Model (TWTM). TWTM is a framework that leverages both 
the tags and words information, not only to learn the document-topic and topic-word distributions, but also to infer the tag-topic 
distributions for text mining tasks. We present an efficient variational inference method with an EM algorithm for estimating the 
model parameters. Meanwhile, we propose three large-scale solutions for our model under the MapReduce distributed computing 
platform for modeling large-scale SSDs. The experimental results show the effectiveness, efficiency and the robustness by 
comparing our model with the state-of-the-art methods in document modeling, tags prediction and text classification. We also 
show the performance of the three distributed solutions in terms of time and accuracy on document modeling. 

Index Terms —semi-structured documents, topic model, tag-weighted, variational inference, large-scale, parallelized solutions 
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1 Introduction 

N the evolution of the Internet, there have been 
a huge amount of documents in many web ap¬ 
plications. Such kinds of documents with both plain 
text data and document metadata (tags, which can be 
viewed as features of the corresponding document) 
are called the Semi-Structured Documents (SSDs). 
How to characterize the semi-structured document 
data becomes an important issue addressed in many 
areas, such as information retrieval, artificial intelli¬ 
gence and data mining etc. The tags can be more 
important than the text data in document mining. 
For example, in IMDB 0, the world's most popular 
and authoritative source for movie, TV and celebrity 
content, each movie has lots of tags, like director, 
writers, stars, country, language and so on, and a 
storyline as text data. Given a movie with a tag "Dick 
Martin", we may have an idea that it has a higher 
chance to be a comedy, without read the full text of 
its storyline or watch it. Another example is that in 
a collection of scientific articles, each document has a 
list tags(authors and keywords). Before read the main 
text of paper, we would know what it talks about after 
we see the authors or the keywords that the paper 
provides. 

Many solutions have been proposed to deal with 
the semi-structured documents (e.g., SVD, LSI), and 
shown to be useful in document mining JTTJ, 1251 , l35l . 
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||33l , e.g., text classification and structural informa¬ 
tion exploiting. For document modeling, topic models 
have been used to be a powerful method of analyzing 
and modeling of document corpora, using Bayesian 
statistics and machine learning to discover the the¬ 
matic contents of untagged documents. Topic models 
can discover the latent structures in documents and 
establish links between them, such as latent Dirichlet 
allocation (LDA) |5J- However, as an unsupervised 
method, only the words in the documents are mod¬ 
eled in LDA. Thus, LDA could only treat the tags as 
word features rather than a new kind of information 
for document modeling. 

To model semi-structured documents needs to con¬ 
sider the characteristics of different kinds of objects, 
including word, topic, document, and tag, and the 
relationship among them. In this problem, topic is 
a kind of hidden objects, and the other three are 
the observations. Relative to tag, word and docu¬ 
ment are objective; tag can be either objective (e.g., 
author and venue information of publications) and 
subjective (e.g., tags in social bookmark marked by 
people). Similar to the topic models, we should con¬ 
sider binary relationships between the pairs of the 
objects, including topic-word and document-topic. In 
addition, we may consider the binary relationships, 
like tag-word, tag-topic, tag-document, and tag-tag. 
The tag-document relationship implies that we should 
consider the weights of the tags in each document. 
The tag-topic and tag-tag relationships can be more 
complicated, thus are difficult to model. Some earlier 
works consider certain tags. For example, the author- 
topic model in 13TI considers the authorship informa¬ 
tion of the documents to be modeled. In this work, we 
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don't limit the types and number of the tags in each 
document. In an extreme case, where there is no tag 
in any document, the new model may degenerates 
into LDA. On the other hand, since the tags can be 
created by some people, they should be relevant to 
topics of the documents; however, some of them may 
be correlated, redundant, and even noisy. Therefore, 
the tag-topic relationships should be general enough 
and we should also model the weights of the tags in 
each document. 

In the past few years, researchers have proposed ap¬ 
proaches to model documents with tags or labels |26| , 
f29], Il30l . For example. Labeled LDA l29l assumes 
there is no latent topics and each topic is restricted 
to be associated with the given labels. PLDA assumes 
that each topic is associated with only one label l30l . 
However, both Labeled LDA and PLDA have implicit 
assumptions that the given labels should be strongly 
associated with the topics to be modeled or the labels 
are independent to each other. 

Another problem is that we would get into trou¬ 
ble when we need to deal with large-scale semi- 
structured documents. A variety of algorithms have 
been used to estimate the parameters of these pro¬ 
posed topic models for mining documents, such as 
Monte Carlo Markov chain (MCMC) sampling tech¬ 
niques el m , variational methods J3) and others 
methods [2J, 11321 . For sampling methods, actually, we 
may have to appeal to a tailored solution of MCMC 
[7J for a particular model, which would impede the 
requirement of convergence properties and speed, es¬ 
pecially when the corpus comprise millions of words. 
Variational methods as approximation solutions to 
some extent improve the learning speed. However, 
it would also be ineffective on learning speed and 
model accuracy when it comes to a large-scale corpus. 

In this paper, we propose a framework of Tag- 
Weighted Topic Model (TWTM) to represent the text 
data and the various tags with weights to evalu¬ 
ate the importance of the tags. Besides learning the 
topic distributions of documents and generating the 
topic distributions over words, the framework also 
infers the topic distributions of tags. The weights 
of observed tags in each document, which we infer 
from the dataset, give us an opportunity to provide a 
method to rank the tags. 

In many web applications, not all the documents 
in the corpora have tags. There are lots of documents 
only consist of words without any tags which maybe 
removed after data preprocessing for denoising. Only 
consider the weights among tags would not hold 
this case. To address this problem, we also propose 
a more flexible model called Tag-Weighted Dirich- 
let Allocation (TWDA) as an extended model. It is 
based on TWTM, and learns the weights among a 
Dirichlet prior and the given tags, not just among 
the tags. Therefore, TWDA handles not only the 
semi-structured documents, but also the unstructured 


documents. For the unstructured documents, TWDA 
degenerates into latent Dirichlet allocation (LDA). 
For hybrid corpora which consist of both the semi- 
structured documents and unstructured documents, 
TWDA can handle this complex type of corpora more 
effectively and easily. 

For the challenge of modeling large-scale corpora, 
we propose three distributed schemes for the frame¬ 
work of TWTM model in MapReduce programming 
framework ESI- The proposed model has four princi¬ 
pal contributions. 

1) It is a novel topic modeling method to model 
the semi-structured documents, not only gener¬ 
ating the topic distributions over words, but also 
inferring the topic distributions of tags. 

2) The TWTM leverages the weights among the 
observed tags in a document to evaluate the 
importance of the tags using a function of tag- 
weighted topic assignment process. The weights 
are associated with the observed tags in a doc¬ 
ument providing a way to rank the tags. In 
addition, this could be used to predict latent tags 
in the document. 

3) The framework of tag-weighted process is easy 
to extend for many different real world appli¬ 
cations. For example, with the extended model 
TWDA, we can handle both the multi-tag docu¬ 
ments and non-tag documents simultaneously, 
which is very useful to process some compli¬ 
cated web applications. 

4) Three distributed solutions for TWTM have been 
proposed that focus on challenges of working 
at a large-scale semi-structured documents in 
MapReduce programming framework. 

The rest of the paper is organized as follows. In Sec¬ 
tion |21 we first analyze and discuss related works. In 
Section [3l after introducing the notations, we present 
the novel topic modeling framework of TWTM, and 
give the methods of learning and inference. In Sec¬ 
tion ffl we show the extended model TWDA, and give 
the process of learning and inference. In Section [5j 
we will give the theoretical analysis to discuss the 
differences between TWTM and TWDA, comparing 
the other topic models. In Section[6l we propose three 
distributed solutions of TWTM for a large-scale semi- 
structured documents. In Section [3 we present the 
experimental results on three domains to show the 
performance of the proposed method in document 
modeling, text classification and the effectiveness and 
efficiency of the three large-scale solutions on a large 
scale semi-structured documents modeling. We end 
the paper in Section [H] 

2 Related Works 

Topic models provide an amalgam of ideas drawn 
from mathematics, computer science, and cognitive 
science to help users understand unstructured data. 
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There are many topic models proposed and shown 
to be powerful on document analyzing, such as in 
l28l , l20l , (9| , 181 , flit)! , O, which have been applied 
to many areas, including document clustering and 
classification fl2j , and information retrieval 1341 . They 
are extended to many other topic models for different 
situation of applications in analyzing text data f2lT , 
l23l , ||36| . However, most of these models only con¬ 
sider the textual information and can only treat the 
tag information as plain text as well. 

TMBP | fI7| and cFTM lT~5l propose the methods to 
make use of the contextual information of documents 
for topic modeling. TMBP is a topic model with biased 
propagation to leveraging contextual information, the 
authors and venue. TMBP needs to predefine the 
weights of the author and venue information on 
word assignment, which limits the usefulness in real 
applications. The method of cFTM has a very strong 
assumption that each word is associated with only 
one tag, either author or venue. In many applications, 
this assumption may not hold. 

Several models have been proposed to take ad¬ 
vantage of tags or labels, such as Labeled LDA (29l . 
DMR Il26l and PLDA f30l . or modeling relationships 
among several variables, such as Author-Topic Model 
l3ll . Labeled LDA f29l get the topic distribution for 
a document through picking out the several hyper¬ 
parameter components that correspond to its labels, 
and draw the topic components by the new hyper¬ 
parameter without inferring the topic distribution of 
labels. Labeled LDA does not assume the existence 
of any latent topics If30ll . PLDA 1301 provides an¬ 
other way of modeling the tagged text data, which 
assumes the generation topics assignment is limited 
by only one of the given tags for one word, and in 
the training process, PLDA assumes that each topic 
takes part in exactly one label, and may optionally 
share global label present on every document. In 
Author Topic Model, it obtains the topic distributions 
of authors, without giving the importance weights 
among the given authors in each document. DMR 
f26l is a Dirichlet-multinomial regression topic model 
that includes a log-linear prior on the document- 
topic distributions, which is an exponential function 
of the given features of the document. However, DMR 
doesn't output the tag weights either |31J, which is 
useful for tag ranking. 

So in this work, we propose a tag-weighted topic 
modeling framework which leverages the tag infor¬ 
mation given in a document by a list of weight 
values to model the topic distribution of the docu¬ 
ment. Meanwhile, for a mixture collection of semi- 
structured documents and unstructured documents, 
we present an extended model called tag-weighted 
Dirichlet Allocation which considers both a Dirich- 
let prior and the tags by the weight values among 
them. Based on the framework of Tag-Weighted Topic 
Model, we also show three large-scale solutions under 


the MapReduce distributed computing platform for 
large-scale semi-structured documents. 

3 TWTM Model and Algorithms 

In this section, we will mathematically define the 
tag-weighted topic model (TWTM), and discuss the 
learning and inference methods. 

3.1 Notation 

Similar to LDA (3, we formally define the following 
terms. Consider a semi-structured corpus, a collec¬ 
tion of M documents. We define the corpus D = 
{(w^t 1 ),..., (w M ,t M )}, where each 2-tuple (w d ,t d ) 
denotes a document, the bag-of-word representation 
w d = (wf ,..., w%), t d = (tf ,..., t d ) is the document 
tag vector, each element of which being a binary tag 
indicator, and L is the size of the tag set in the corpus 
D. For the convenience of the inference in this paper, 
t d is expanded to a l d x L matrix T d , where l d is 
the number of tags in the document d. For each row 
number i e {1,..., Z d } in T d , T d is a binary vector, 
where T d j = 1 if and only if the i-th tag of the 
document cd is the j-th tag of the tag set in the corpus 
D. In this paper, we wish to find a probabilistic model 
for the corpus D that assigns high likelihood to the 
documents in the corpus and other documents alike 
utilizing the given tag information. 

3.2 Tag-Weighted Topic Model 

TWTM is a probabilistic graphical model that de¬ 
scribes a process for generating a semi-structured 
document collection. In the previous topic models, a 
document d is typically characterized by a multino¬ 
mial distribution over topics, 6 d , and each topic k is 
represented by %l>k r over words in a vocabulary. Take 
LDA 13 as an example, the generative process of topic 
distribution of document d is assumed as follows. 

Choose 9 d ^Dirichlet (cr), 
and choose z nl ^Multinomial (9 d ), 

where a is the hyperparameter of 9 d . In LDA, the topic 
distribution 9 d is drawn from a hyperparameter a, 
without considering the given tags. However, the tag 
information should be more useful for the generation 
of 9 d than a Dirichlet prior. 

In this paper, we use 0 d , instead of 9 d , to denote the 
topic distribution of document d as shown in Figure [T| 
Let 9 represent a L x K topic distribution matrix 
over the tag set, where K is the number of topics. 
Let 'tp represent a K x V distribution matrix over 
words in the dictionary, where V is the number of 
words in the dictionary of D. Similar to LDA, TWTM 
models the document d as a mixture of underlying 

2. Note that we can sort the tags of the document d by the index 
of the tag set of the corpus D. 
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Fig. 1. Graphical model representation for TWTM, 
where 9 is distribution matrix of the whole tags, i\) is 
distribution matrix of words, e d represents the weight 
vector of the tags, and i) d indicates the topic compo¬ 
nents for each document, tt is a Dirichlet prior and 77 is 
a Bernoulli prior. 

topics and generates each word from one topic. The 
topic proportions fl d of the document d is a mixture 
of tag-topic distributions, not only controlled by a 
hyperparameter described as in LDA. 

The generative process for TWTM is given in the 
following procedure: 

1) For each topic k € {1 draw i/>k ~ Dir(/3) , 

where /3 is a V dimensional prior vector of p. 

2) For each tag t € {1,, L}, draw 8 t ~ Dir(a), where 
a is a A' dimensional prior vector of 9. 

3) For each document d\ 

a) For each l € {1,..., L}, draw tf ~ Bernoulli^;). 

b) Generate T d by t d . 

c) Draw e d ~ Dir(T d x 7r). 

d) Generate 9 d = (e d ) T x ( T d x 9). 

e) For each word w,u : 

i) Draw Zdi ~Mult(d d ). 

ii) Draw w<u ~Mult(p* (J4 ). 

In this process, Dir(-) designates a Dirichlet distri¬ 
bution, Mult(-) is a multinomial distribution, and n 
is a L x 1 column vector, a Dirichlet prior. Note that 
e d indicates the weight vector of the observed tags in 
constituting the topic proportions of the document d, 
and (e d ) T is the transpose of e d . Furthermore, e d is 
drawn from a Dirichlet prior which obtained by the 
matrix multiplication of T d x 7 r. Clearly, the result of 
T d x 7T will be a (l d x 1) vector whose dimension is 
depended on the number of the observed tags in the 
document d. 

In Step 0 for one document d, we first generate the 
document's tags tf using a Bernoulli coin toss with a 
prior probability rji, as shown in step (a). After draw 
out the e d , we generate the d d through e d , T d and 9. 
The remaining part of the generative process is just 
familiar with LDA As shown above, in TWTM, 
we introduce a novel way to model the topic pro¬ 
portions of semi-structured document by document- 
special tags and text data. The key discussed in this 


paper is the tag's weight topic assignment by which 
d d is generated through e d , T d , and 9, which provides 
an effective and direct method to infer the weights of 
the tags. 

3.3 Tag-Weighted Topic Assignment 

As we assume that all the observed tags in the docu¬ 
ment d make contributions to infer the topic distribu¬ 
tion r) d of the document, it is expected that different 
tags works corresponding to their own weights. For 
example, in some blog application, a blog has tags of 
an author, a blog's date, a blog category and a blog's 
url. Clearly, compared to other tags, the tag of the 
author plays the most important role in constituting 
the topic components of the blog. 

The function of how to leverage the tag information 
or contextual to infer topic distribution of a document 
is defined as follows: 

# <— f(ti, ■ ■ ■ ,ti), 

where /(•) is the way of making use of the tag 
information. Topic models using tag information or 
contextual take advantage of the different /(•) in 
the past. In TWTM, we assume that is made up 
by all the observed tags with their own weights. 
Figure |T| shows that how TWTM works in a prob¬ 
abilistic graphical model. As shown in Figure [H t) d 
is controlled by two sides, the topic distributions 
over tags 9, and the weights of the given tags of 
the document d. It is important to distinguish TWTM 
from the Author-Topic Model |3TI . In the author-topic 
model, the words w is chose only by one of the given 
tags' distribution, while in TWTM, for word w, all 
the observed tags in the document would make the 
contributions. 

The /(•) in the proposed model is assumed as this, 
for the document d, 

f($ d ) = (e d ) T x T d x 9, 

where the linear multiplication of (e d ) T , T d and 
9 maintains the condition of = 1 without 

normalization of fl d , since e d and 9 satisfy 

l d K 

i-E^ = L 

i= 1 k =1 

Firstly, we pick out the topic distributions of the given 
tags in the document d from 9 by T d x 9, where T d 
is a l d x L matrix and 9 is a L x K matrix. Here we 
define 

0 d = T d x 9, 

where the Q d is a l d x K topic distribution matrix of 
the given tags in d as sub-components of 9. Secondly, 
e d is the weight vector of the observed tags in d, 
and each dimension of e d represents the weight or 
importance associated to the corresponding tag. Thus, 
d d is mixed by (-)' 1 with corresponding weight values. 

» d = (£ efefi, efeij,. • •, i: s d ef K ). 

i= 1 2 = 1 2=1 
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Fig. 2. Graphical model representation of the varia¬ 
tional distribution used to approximate the posterior in 
TWTM. 


With , 0 d , TWTM generates all the words in the 
document d with the assumption of bag-of-words. 

Based on the above framework, we can define a 
special topic assignment function /(•) in an extended 
model for a real world application. 


3.4 Inference for TWTM 

In the topic models, the key inferential problem that 
we need to solve is to compute the posterior distri¬ 
bution of the hidden variables given a document d. 
Given the document d, we can easily get the posterior 
distribution of the latent variables in the proposed 
model, as: 


p{e d , z\w d ,T d , 6 , t), ip, 7r) 


p(e d , z. w d , T d \9, p, ip, n) 
p(w d , T d \9, p,ip, 7r) 


(1) 


In Eq. IQ}, integrating over e and summing out z, we 
easily obtain the marginal distribution of d: 


p(vr d ,T d \p,9,ip,n) = p{t d \p) Jp (e d \(T d X tt)) 

N K 

■ n £ p(zf\(e d ) T x T d x e)p(wt\4,ih ;K ) de d . 

*=1 zf = 1 


In this work, we make use of mean-field variational 
EM algorithm [4j to efficiently obtain an approx¬ 
imation of this posterior distribution of the latent 
variables. In the mean-field variational inference, we 
minimize the KL divergence between the variational 
posterior probability and the true posterior probabil¬ 
ity through by maximizing the evidence lower bound 
(ELBO) £(•) | 8 j. For a single document d, we obtain 
the £(•) using Jensen's inequality: 

£(€l :l d,'yi:K',Vl:L,Kl:L,8l:L,'ll’l:K) 

= E[\ogp(T 1 , ld \r 11 , L )} + E[\ogp{e d \T d x n)] 
N 

+ J2 E i l °SP^i\^ d ) T *T d x 0 )] 
i=1 
N 

+ Y^ E i lo SP{' w i\zi,ipi: K )\ + H{q), 
i= 1 

where £ is a ^-dimensional Dirichlet parameter vec¬ 
tor and 7 is 1 x K vector, both of which are variational 
parameters of variational distribution shown in Fig- 
ure| 2 l and II (q) indicates the entropy of the variational 
distribution: 


H(q) = -_E[logg(e d )] - £[logq(z)]. 

Here the exception is taken with respect to a vari¬ 
ational distribution q(e d , Zi-.n), and we choose the 


following fully factorized distribution: 

N 

9(e d ,^i:Ar|Si:l,,7i:Jf) = 9(e d |€) n 9( z *I t»)- 

i= 1 

The dimension of parameter f is changed with dif¬ 
ferent documents. It could be difficult to compute the 
expected log probability of a topic assignment by the 
way of tag-weigh ted topic assignment used in TWTM. 

Then, we maximize the lower bound £(•) with 
respect to the variational parameters £ and 7 , using a 
variational expectation-maximization(EM) procedure 
as follows. 

3.4.1 Variational E-step 

We first maximize £(•) with respect to £, for the 
document d. Maximize the terms which contain £: 

Tj 

= £(£ -i' T u' - i)(*(fc) - *(£ 

i=i i '=1 j'=i 

N K l d l d 

+ £ £ ■ £ loge^^/ £ 

i=lk=1 J' =1 (2) 

l d i d 

- io g r(£&) + £io g r($i) 
i= 1 i= 1 

l d l d 

-£«i-i)(*«i)-*( £«/))’ 

i=i / =1 

where W(-) denotes the digamma function, the first 
derivative of the log of the Gamma function. Here we 
use gradient descent method to find the £ to make the 
maximization of £^j. 

Next, we maximize £(•) with respect to 7 Adding 
the Lagrange multipliers to the terms which contain 
7 tk, taking the derivative with respect to 7 ^, and 
setting the derivative to zero yields, we obtain the 
update equation of 7 

ld £■ 

lik (X i>k,v w i ex P{£ logS^--}, (3) 

T.f= 1 G' 

where v Wi denotes the index of w t in the dictionary. 

In E-step, we update the £ and 7 for each document 
with the initialized model parameters. For the reason 
of different document with different number of tags, 
we have to keep all the £ updated by each document 
for the M-step estimation. 

3.4.2 M-step estimation 

The M-step needs to update four parameters: 7 , the 
tagging prior probability, n, the Dirichlet prior of the 
tags' weights, 9, the topic distribution over all tags in 
the corpus, and ip, the probability of a word under 
a topic. Because each document's tag-set is observed, 
the Bernoulli prior 7 is unused included for model 
completeness. For a given corpus, the rji is estimated 
by adding up the number of i-th tag which appears 
in the corpus. 











For the document d, the terms that involve the 
Dirichlet prior 7 r: 


£ W = logr (£ ( T d x k). ) - £> g r ((T d x W ) 4 ) 


r 

I 

i= 1 


+5^ ((' T d x W ) 4 -1) m) - *(£ &)). 


j=i 


(4) 


where (T d x 7r) i = ]D * =1 Sf=i n i T iV We use gradient 
descent method by taking derivative of Eq. (0) with 
respect to iri on the corpus to find the estimation of 

7 T. 

To maximize with respect to 6 and ip, we obtain the 
following update equations: 


D N 


^ik 

d—1 i= 1 




zLi($ 


*f) 


(5) 


and 




( 6 ) 


We provide a detailed derivation of the variational 
EM algorithm for TWTM in Appendix A. And we 
show the variational expectation maximization (EM) 
procedure of TWTM in Algorithm [TJ 


Algorithm 1 The variational expectation maximiza¬ 
tion (EM) algorithm of TWTM 

1 : Input: a semi-structured corpora including totally 
V unique words, L unique tags, and the expected 
number K of topics. 

2 : Output: Topic-word distributions ip, Tag-topic dis¬ 
tributions 6, it, topic distribution i) d and weight 
vector e d of each training document. 

3: initialize n, and initialize 9 and ip with the con¬ 
straint of Ylk=i equals 1 and l V’fei equals 
I. 

4: repeat 

5: for each document d do 

6 : update with Eq. © using gradient descent 

method. 

7: update 7 ifc with Eq. 01. 

8 : end for 

9: update 7 r with Eq. {I} using gradient descent 

method. 

10 : update 9 by Eq. 10. 

11 : update ip by Eq. 0. 

12 : until convergence 


4 Tag-Weighted Dirichlet Allocation 

In a real world application, a corpus is very likely 
to contain both semi-structured documents and un¬ 
structured documents. Many documents in the corpus 
have no tags, just with unstructured text data. In 
this case, TWTM does not work, which generates 
the topic distribution of a document by leveraging 



Fig. 3. Graphical model representation for TWDA, 
where /j is a Dirichlet prior of A. 


the weights among the observed tags. Our proposed 
solution to the problem is to add a Dirichlet prior to 
the topic distribution i) d , which means that we learn 
the weights among the Dirichlet prior and the given 
tags, not just among the tags. We call this solution Tag- 
Weighted Dirichlet Allocation (TWDA). When han¬ 
dling the unstructured documents in a hybrid corpus, 
TWDA degenerates into LDA |9| which just draws 
the topic proportions for a document from a Dirichlet 
distribution. 

As an extended model of TWTM, TWDA uses the 
same parameter notations. Unlike TWTM, for the 
convenience of the inference in TWDA, t d is expanded 
to a l d x (L + 1) matrix T d , where l d is one more than 
the number of the given tags in the document d (For 
example, if the document d has five tags, l d is six). For 
each row number i£{l,...,U}in T d , T d is a binary 
vector, where Tfj = 1 if and only if the i-th tag of the 
document d is the j-th tag of the tag set in the corpus 
D. Note that, we set the last dimension of the last row 
in T d to 1, and the other dimensions of the last row 
equal to 0 for all documents. The detail of the above 
setting will be shown later. 

TWDA defines a Dirichlet prior // over a latent topic 
distribution of a document, and mixes the latent topic 
proportion with these topic distributions of the given 
tags by importance or weight (tag-weighted) to form 
the final topic distribution of the document. Figure 0 
shows the graphical model representation of TWDA, 
and the generative process for TWDA is given in the 
following procedure: 

1) For each topic k £ (1 draw ipk ~ Dir(/3) , 

where /? is a V dimensional prior vector of ip. 

2) For each tag t £ {1,..., L}, draw d t ~ Dir(a), where 
a is a A" dimensional prior vector of 9. 

3) For each document d: 

a) Draw A ~ Dir(fj,). 

b) Generate T d by t d . 

c) Draw s d ~ Dir(T d x 7r). 

d) Generate = (e d ) T x T d x (|) . 

e) For each word w*: 
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i) Draw Zdi ~Mult($ d ) . 

ii) Draw Wdi ~Mult(i/> Z(Ji ) . 

Note that, L is the number of tags appeared in the 
corpora and K is the number of topics. Different from 
TWTM, here 7r is a (L + 1) x 1 column vector and p is a 
I\ x 1 column vector. Both of them are Dirichlet prior. 
A is a 1 x A' row vector which is drawn from //. (e d ) T 
is the transpose of e d , and e d is drawn from a Dirichlet 
prior which obtained by the matrix multiplication of 
T d x 7T. Clearly, the result of T d x 7r will be a ( l d x 1) 
vector whose dimension is depended on the number 
of the observed tags in the document d. Note that, l d 
is one more than the number of tags given in d as we 
described above. 

In other words, we treat the A as a topic distribution 
of one latent tag, the Dirichlet prior //.. Each document 
is controlled by a latent tag, that is the same idea 
both TWDA and Latent Dirichlet Allocation (LDA). 
The form of (j) is the augmented matrix of 9 and 
A, which represents that we add the vector A to the 
matrix 9 as the last row, so ( j) becomes a (L + 1) x K 

matrix. As we show above, T d is the matrix form of 
the given tags in the document d, and the last row of 
T d is a binary vector, of which only the last dimension 
equals to 1 and the others equal 0. Here we define 

0 d = T d x (-). 

Clearly, Q d is a l d x K matrix, whose last row is 
A. Actually, the purpose of 0 d is to pick out the rows 
corresponded to the tags appeared in d from tag-topic 
distribution matrix 9. 

The key idea of tag-weighted Dirichlet allocation 
is to model the topic proportions of semi-structured 
documents by document-special tags and text data. 
Different from LDA, the topic proportion of one doc¬ 
ument assumed in this model is controlled not only by 
a Dirichlet prior ft, but also by all the observed tags. 
The way to generate the normalized topic distribution 
of the document d is that we mix both Dirichlet allo¬ 
cation and tags information through a weight vector 
e d . Thus, we use the function /(■) of topic assignment 
to obtain the topic distribution of d by 

/(i? d ) = (e d ) T x T d x (j). 

It is worth to note that the e d is draw by a Dirichlet 
prior 7r, each row of 9 is draw by a Dirichlet prior 
a, and A is draw by a Dirichlet prior //, so e d and 9 
satisfy 

l d K K 

£ 4 = t. Oik = 1. and X k = 1- 
2=1 k =1 k =1 

Therefore, the linear multiplication of (e d ) T , T d , 9 
and A maintains the condition of J2k=i^k ~ 1 without 
normalization of i) d . With 0 d , the topic proportions of 
the document d, the remaining part of the generative 
process is just familiar with LDA. 


© 

© 

© 


© 

© 


M 


Fig. 4. Graphical model representation of the varia¬ 
tional distribution used to approximate the posterior in 
TWDA. 


4.1 Inference for TWDA 

In TWDA, we treat 7r, //, r], 9 and p as unknown 
constants to be estimated. Similar to TWTM, the 
marginal distribution of d is not efficiently computable 
as follows: 

p(w d ,T d \ri,6,ilj,ir,ii) = p(t d \r/) J p (e d | (T d X 7r)) 

N K „ 

■P(XMU £ P(4\(e d ) T xT d x(-)) 

i=i z d =1 A 

■p(.wf\zf,ip 1:K ) de d . 

In this case. We also use a variational expectation- 
maximization (EM) procedure to carry out approxi¬ 
mate maximum likelihood estimation of TWDA. 

4.1.1 Variational inference 

In TWDA, we use the following fully factorized dis¬ 
tribution as shown in Ligure [4] 

N 

q(e d , \ d , Z 1:N \£ 1:L , Puk^uk) = <?(£ d |C)g(A d |p) n q(zi\"fi), 

i=l 

and the entropy of the variational distribution will 
be 

H(q) = -E[logq(e d )] - £[logq(A)] - £[logg(2:)]. 

Lor the variational parameter £, we take the terms 
which contain if out of the evidence lower bound 
(ELBO) £(■) of TWDA to form and we use 
gradient descent method to find the £ to make the 
maximization of : 

l d L +1 l d 

= £(£ *i’ T u' - !)(*&) - *( £«/)) 
i=i i'=i j'=i 

N K l d c 

+ ££™-£c®- i A— 

i= i fe= i j=1 E /=iV (7) 

i d i d 

- i°gr(£&) + £iogr(&) 

2=1 2=1 

l d i d 

-£«<-!)(*(&)-®(£ V))’ 

i=1 i '=i 

where 

r U) = j^soi j) j e {i,- 

k \^(pfc) - ^(E/ =1 p j') d=ld 

and 'T(-) denotes the digamma function, the first 
derivative of the log of the Gamma function. 











In particular, by computing the derivatives of the 
£,(■) and setting them equal to zero, we obtain the 
following pair of update equations for the variational 
parameters p d and 7 ,;/,: 


€l d 

Z> 

n=1 2^j=l 

( 8 ) 

i =1 

(9) 


where v Wi denotes the index of w t in the dictionary. 

In the E-step, we update the variational parameters 
£, p and 7 for each document with the initialized 
model parameters. We show the detailed derivation 
of the variational parameters for TWDA in Appendix 
B. 

4.1.2 Model Parameter Estimation 
There are four model parameters that need to estimate 
in M-step, 7 r, the Dirichlet prior of the tags' weights, 
9, the topic distribution over all tags in the corpus, 
ip, the probability of a word under a topic, and p, a 
Dirichlet prior of model. In TWDA, we can estimate 
7 r, 9 and ip as same as in TWTM. 

Different from TWTM, TWDA has an extra Dirichlet 
prior p. The involved terms of p are: 

D K K 

£[ M ] = -^ logr( ^) 

d=1 J=1 1=1 ( 10 ) 

K K V ' 

i= 1 j =1 

We can invoke the linear-time Newton-Raphson algo¬ 
rithm to estimate p as same as the Dirichlet parameter 
described in LDA 0. 

In the variational expectation maximization (EM) 
procedure of TWDA, we update the variational pa¬ 
rameters p and 7 & with Eqs. ©, © and © 
respectively in the E-step. In the M-step, besides the 
update of ir, 9 and ip, we also update p with Eq. (TlOb by 
Newton-Raphson algorithm. The detailed derivation 
of the model parameter estimation in TWDA is shown 
in Appendix B. 

5 Analysis of TWDA 

In TWDA, we introduce a better way to directly model 
the semi-structured documents and unstructured doc¬ 
uments by adding a latent tag to each documents, 
which the topic distribution of a document is con¬ 
trolled by the observed tags and one latent tag. In 
LDA, the topic distribution of a document is drawn 
from a hyperparameter, without considering the given 
tags, and while in TWTM, the topic distribution is 
controlled by a list of given tags with corresponding 
weight values. The main difference among the models 
which handle the unstructured text (e.g., LDA and 
CTM 0) or the semi-structured documents (e.g., ATM 
EE), Label-LDA (29), DMR |26] and PLDA EQ)) is the 


function that how to generate the topic distribution of 
a document, or, in other words, the assumption that 
what distribution the topic of a document follows. 

In TWDA, the topic proportions 0 d for a document 
d is obtained by the following function: 

d d = (e d ) T x T d x (^) 

A 

When we ignore the tags in a document, the T d 
in Eq. 10 becomes a binary row vector and the last 
dimension equals to 1 and the others are 0. In this 
case, (j ) is simplified to A: 

d d = (s d ) T xT d x (^) 

A 

= A. 

The topic distribution of d is simplified to A, and as 
we shown above, A is draw by a Dirichlet prior p. It 
means that the topic proportions for the document 
d as a draw from a Dirichlet distribution which is 
the basic assumption of LDA 0. In others words, 
when handling the unstructured documents, TWDA 
degenerates into LDA. 

In other words, the topic distribution of a docu¬ 
ment in TWTM is the weighted average of the topic 
distributions of the given tags, and to some extent, 
it is a linear relation between the topic distribution 
of a document and the tags. While, in TWDA, with 
the addition of the Dirichlet prior p, which is equal to 
generate a latent tag for each document with a special 
topic distribution, it is a non-linear topic generation 
procedure in each document. 

6 Large Scale Solutions 

Currently, many web applications appear with large 
scale tagged documents, and highlight the issues of 
large scale semi-structured documents in many areas. 
In this paper, we propose and compare three dif¬ 
ferent distributed methods based on the framework 
of TWTM, which focus on the challenge of working 
at a large scale, in the MapReduce programming 
framework. 

Solution I 

The first solution is a tailored parallel algorithm for 
TWTM. The learning and inference of the proposed 
model are based on variational method with an EM 
algorithm. Thus, we design a parallel algorithm for 
TWTM using MapReduce programming framework. 

As shown above, we need to update the global 
parameters 77 9, and ip for a corpus. Every document 
has associated with the corresponding variational pa¬ 
rameters p and 7 . The mapper computes these varia¬ 
tional parameters for each document and uses them to 
generate the sufficient statistics to update 77 9, and ip. 




9 


And the reducer updates the global parameters 7 r, 9, 
and ip. 

1) Mapper: For each document d, we compute j d 
using the update equation Eq. 0 and obtain p rl 
by Eq. 0 . The sufficient statistics are kept for 
each document. 

2) Reducer: The Reduce function adds the value 
to the global parameters 9 and ip using the 
sufficient statistics as in Eqs. (01, and 0). 

3) Driver: The driver program marshals the entire 
inference process. At the beginning, the driver 
initializes all the model parameters K, L, 9, ip, 
and 7 r. The topic number K is user specified; the 
number of tags L is determined by the data; the 
initial value of 7r is given by the user, 9 and ip 
is randomly initialized. After each MapReduce 
iteration, the driver normalizes the global 9 and 
ip. 

Note that, because 7r is a global parameter over the 
corpus, we have to update tt at the end of each 
iteration in driver. However, this will lead to a large 
scale data migration to compute the 7r by Eq. 0, 
since 7r is associated with each document and different 
documents have different tags which affect the differ¬ 
ent dimensions in tt. The whole corpus data would 
migrate to the single driver node. This could generate 
a bottleneck in the driver. 

Solution II 

On account of the bottleneck in Solution I, we op¬ 
timize the calculation of 7 r through an approximate 
method as the Solution II. The MapReduce procedure 
of Solution II is as follows. 

1) Mapper: For each document d, we compute 7“ 
and p d by Eqs. 0 and 0 and the sufficient 
statistics for updating 9 and ip. Different with 
Solution I, we obtain a 7r s for each map data 
split s by Eq. 0. 

2) Reducer: The Reduce function adds the value 
to the global parameters 9 and ip using the 
sufficient statistics as in Eqs. 0 , and 0 . 

3) Driver: In the driver function, we only need to 
compute an average of 7r s , s £ (1, • • • ,S) where 
S is the total number of mapper in the cluster. 
The driver also normalizes the global 9 and ip 
for next iteration. 

Solution II is an approximate solution of TWTM, 
which computes the 7r s for each mapper and takes 
their average as the solution of 7r to avoid the large 
scale data migration. 

Solution III 

As shown in Eq. 0 , 7 r/,1 £ (1, • • • , L) is only as¬ 
sociated with the documents who contain the I th 
tag. Thus, before running TWTM, we can cluster the 
documents into several clusters with the condition 


that the documents which contain one or a plurality 
of the same tag should be in the same cluster. It 
means that the documents are divided into the mu¬ 
tually independent space by the tags. We show the 
detailed process of the clustering in Appendix C. The 
MapReduce procedure of Solution III is the following 
procedure. 

1) Mapper: The input of mapper is clusters. For 
each cluster, we obtain a 7r c for the cluster c, 
c £ (1, • • • >C)/ where C is the number of doc¬ 
ument clusters, which is the sufficient statistics 
for updating 9 and ip. 

2) Reducer: The Reduce function adds the value 
to the global parameters 9 and ip using the 
sufficient statistics as in Eqs. 0, and 0. 

3) Driver: In the driver, we update 9 and ip. Note 
that there is no need to recompute 7r, and we 
combine all the n c to obtain the final n for 
current iteration. 

Solution III is an exact solution for TWTM, and it is 
equivalent to Solution I when the documents are all 
belong to one cluster. However, Solution III provides 
a more efficient method than Solution I, and this 
depends on the result of document clustering, which 
would be anther bottleneck in some real applications. 
Although Solution II is an approximate method for 
modeling the semi-structured documents, it effec¬ 
tively avoids the bottleneck brought by Solution I 
and Solution III. The experiment results in Section 0 
show that Solution II works better than Solution I and 
Solution III. 

It is worth note that all the solutions need to 
iterate the MapReduce procedure in driver function 
until convergence or maximum number of iterations 
is reached. In Section 0 we show the experimental 
results about the comparisons of the three solutions 
on document modeling and efficiency. 

7 Experimental Analysis 
7.1 Experiment Settings 

In the experiments of this work, we used three 
semi-structured corpora. The first document collection 
is the data from Internet Movie Database (IMDB). 
The data set includes 12,091 movie storylines, 52,274 
words after removing stop words, and 3,654 tags. 
These movies belong to 29 genres. And the tags we 
used contain directors, stars, time, and movie key¬ 
words. The second one consists of technical papers of 
the Digital Bibliography and Library Project (DBLP) 
data seQ which is a collection of bibliographic infor¬ 
mation on major computer science journals and pro¬ 
ceedings. In this paper, we use a subset of DBLP that 
contains abstracts of I)-27,435 papers, with IT’=70,062 
words in the vocabulary and L= 6,256 unique tags. 
The tags we used in DBLP include authors and 

3. http://www.informatik.uni-trier.de/~ley/db/ 
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(a) TWTM, TWDA, LDA and CTM (b) TWTM, TWDA, DMR, ATM, (c) TWTM, TWDA and PLDA 

CorrLDA, LDA and CTM 


Fig. 5. Perplexity results of different models on IMDB corpora. LDA and CTM only use the words when training 
in (a), and add the tags as the word features during the training process in (b). 


keywords. The last corpus we used contains about 
967,012 Wordpress blog postal from KaggkQ. In the 
corpus, there are 163,504 tags and 2,592,562 words. 
We used this corpus to test the effectiveness and 
performance of TWTM over a large scale dataset. We 
implemented the three distributed methods of TWTM 
using Hadoop 1.1.1 and ran all experiments on a 
cluster containing 7 physical nodes; each node has 4 
cores and 8 threads, and could be configured to run a 
maximum of 7 mappers and 7 reducers of tasks. With 
the configuration, we build different scales distributed 
environments by setting the maximum of mappers 
used in each node. 

We have released the codes on GitHut@ including 
TWTM, TWDA and the three distributed solutions 
using the Hadoop platform. 


7.2 Results on Documents Modeling 

In order to evaluate the generalization capability of 
the model, we use the perplexity score that described 
in For a test set of D documents, the perplexity is: 


perplexity = exp 


Ed logp(w d ) | 

E° d N d /’ 


where a lower perplexity score represents better doc¬ 
ument modeling performance. 

There are two parts of the experiments. First, We 
trained four latent variable models including LDA 
J9j, CTM (7), TWTM and TWDA, on the corpora of 
a set of movie documents in IMDB, to compare the 
generalization performance of the four models. In this 
part, LDA and CTM trains text data without taking 
advantage of tag information. We removed the stop 
words and conducted experiments using 5-fold cross- 
validation. Figure |5(a) demonstrates the perplexity 
results on the IMDB data set. Clearly, TWTM and 
TWDA excel both CTM and LDA significantly and 
consistently. 


Second, in order to compare the performance of 
TWTM and TWDA with other topic models which 
take advantage of the tag information, we trained 
TWTM, TWDA, DMF0, PLDA0, Author Topic Model 
(ATM) ED, CorrLDA®, CTM, and LDA on the set 
of movie documents in IMDB and computed the 
perplexity on test data set. Since CTM and LDA could 
not handle corpus with tags easily, in this experiment, 
we treated the given tags as word features for them. 
In CorrLDA, we used the tags in each document to 
represent the image segments, so that the CorrLDA 
can handle the SSDs. Figure |5(b) demonstrates the 
perplexity results of the six models on the IMDB 
data. The experiment results shows that TWTM and 
TWDA are better than the other models, and when 
T increases, CorrLDA, DMR, CTM and LDA are 
running into over-fitting, while the trend of TWTM 
and TWDA keeps going down and the perplexity is 
significantly lower than those of the baselines. 

As PLDA il30l assumes that one of tags may op¬ 
tionally denote as a tag "latent" present on every 
document d, thus, we trained PLDA, TWTM and 
TWDA over 1021 and 2041 topics on IMDB data set 
with 1020 tags, since in PLDA, each latent topic takes 
part in exactly one tag in a collection. As shown in 
Ii30) . PLDA builds on Labeled LDA Il29l . and when it 
set one latent topic and one topic for each tag, it is ap¬ 
proximately equivalent to Labeled LDA. For this case, 
we trained PLDA over 1021 topics. Figure |5(c)1 shows 
the perplexity results of TWTM, TWDA and PLDA. 
Note that TWDA has less mean squared error (MSE) 
than TWTM. As the results of Figure 0 shown, TWTM 
and TWDA both work well compared with the other 
topic models which make use of tag information. 


7.3 Results on Tags prediction 

In this section we use TWDA to demonstrated the 
performance of our works on the tags prediction by 


4. http://wordpress.com 

5. http://www.kaggle.eom/c/predict-wordpress-likes/data 

6. https://github.com/ShuangyinIi 


7. We used the Mallet code I http://mallet.cs.umass.edu/ 1 . 

8. We used the code in Stanford Topic Modeling Toolbox 
(http://www-nlp.stanford.edU/software/tmt/tmt-0.4/). 
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TABLE 1 

Classification results of different features on FI-score 



Fl-score 

@1 

@3 

@5 

TFIDF 

0.5 

0.41 

0.39 

LDA+TFIDF 

0.5 

0.42 

0.39 

TWDA 

0.57 

0.5 

0.47 

TWDA+TFIDF 

0.58 

0.5 

0.47 


(a) K=100 (b) K=200 

Fig. 6. Prediction results of TWDA, DMR and ATM for 
authors on DBLP corpora. We set the number of topic 
in the corpora to be 100 in (a) and 200 in (b). 


process the paper collection in DBLP. In addition to 
predicting the tags given a document, we evaluate 
the ability of the proposed model, compared with 
ATM, DMR and CorrLDA, to predict the tags of the 
document conditioned on words in the document. 
In this part, we treat the authors of each paper as 
the tags, and the abstract as the word features, and 
we predict the authors of one paper by modeling 
the paper abstract document data using ATM, DMR, 
CorrLDA, and TWDA. For each model, we evaluate 
the likelihood of the authors given the word features 
in a document, and rank each possible author by 
the likelihood function of the author. First, for each 
model, we can get the topic distribution over a test 
document dtest given one author a. Then, we evaluate 
the p(d tes t |a) for dt es t over each author a in the 
tags(authors) set by 

N 

p(dtest\a) = IKE p(z\a)p(wi\z)). 

i z 


For CorrLDA, we let authors represent image regions, 
and used p(d te st\'region) shown in (6j to evaluate 
the likelihood of a author given a document. For 
DMR and ATM, the method which define p{d te st\o) 
is shown as Il26l . Note that the likelihoods for a 
given author over a document are not necessarily 
comparable among the topic models, however, what 
we are interested in is the ranking as same as t26l . 

We trained the three models on DBLP data set using 
5-fold cross-validation and shows the recall when the 
topic in the corpora is set to be 100 and 200. Results 
are shown in Figure |6(a)| and Figure [6(b) TWDA ranks 
authors consistently higher than the other models. 


7.4 Results on Feature Construction for Classifi¬ 
cation 

The next experiment is to test the classification perfor¬ 
mance utilizing feature sets generated by TWDA and 
other baselines. For the base classifier, we use LIBSVM 
113J| with Gaussian kernel and the default parameters. 
For the purpose of comparison, we trained four SVMs 


on tf-idf word features, features induced by a 30- 
topic LDA model and tf-idf word features, features 
generated by a TWDA model with the same number 
of topics, and features induced by a 30-topic TWDA 
model and tf-idf word features respectively. 

In these experiments, we conducted multi-class 
classification experiments using the IMDB data set, 
which contains 29 genres. We calculated the evalua¬ 
tion metrics @1, @3 and @5 with the provided class 
tags of movies' genres, using 5-fold cross-validation. 
We report the movie classification performance of the 
different methods in Figure 13 where we see that there 
is significant improvement in classification perfor¬ 
mance when using LDA and TWDA comparing with 
only using tf-idf features, and TWDA outperforms 
both LDA and tf-idf in terms of @1, @3 and @5. 

In order to show the classification performance 
better, we also calculated the evaluation metrics F- 
Measure (Fl-score). The results of F-Measure is re¬ 
ported in Table 13 TWDA provides substantially better 
performance on F-Measure. 

7.5 Results on Model Robustness 

We demonstrated the performance of our work on 
model robustness in this part of experimental anal¬ 
ysis. In this part, we measured and compared the 
perplexity when we added noise tags information to 
the test documents using DBLP data set. Respectively, 
we randomly added 20%, 40%, 50%, 80% and 100% 
noise tags into a test document and then calculated 
the perplexity. For example, if a paper document in 
DBLP has five authors, adding 20% noise is that we 
randomly selected one author from the author set of 


[7]lIA< 1 • B • I Lf m I W^|7| I WC A i I • ID I 



Fig. 7. Classification results of different features on 
@1, @3 and @5 with 5-fold cross-validation. 
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the DBLP corpora and added into the paper as a noise 
author. 

In some real-world applications, the noise tags ap¬ 
peared in a document may have some relevance to 
the real tags. So in this experiment, we selected the 
noise tags from the author-tag set to meet the real 
applications to some extent. In this experiment, the 
DBLP corpora contains more than 6,000 tags, the noise 
tags we added into a test document would be very 
sparse for the whole tag set in the corpora. So, we 
added the different percentages noise tags into the 
test document to show the trend of perplexity as 
the noise content increases. Figure [8] shows that both 
TWDA and ATM have a more steady trend as the 
noise level increases, compared with DMR. Table [2] 
shows some examples about the weights between the 
original tags and noise tags. The red tags are the 
noise added into the test data, and the values behind 
are the weights among the tags we inference from 
the TWDA model. Note that, we showed the weight 
values after normalized. As the results shown, TWDA 
has a good performance on model robustness, for the 
weight values of the noise tags are much smaller than 
the other original tags. In some applications, we can 
use the proposed model to rank the tags given in a 
document, which would be a good approach to tag 
recommendation and annotation. 

7.6 Results on Large-scale Datasets 

We demonstrated the performance of the three pro¬ 
posed parallelized solutions of TWTM for a large- 
scale dataset from training time and accuracy on 
document modeling, which are suitable for TWDA as 
well. 

Firstly, we measured and compared the training 
time of Solutions I, II and III using the Wordpress blog 
data set with the same system setting and model pa¬ 
rameters. We used a doc-indexed sparse storage mode 
for the matrix of ^-document, for the matrix would 
be very huge over a large scale data set. Figure |9(a)| 
shows the performance on the average training time 



(a) K=100 


(b) K=200 


Fig. 8. The Results of adding noise to different mod- 
els(ATM, DMR and TWDA). (a) set K=100, and (b) set 
K=200. Steady trending means a good performance on 
model robustness. 


TABLE 2 

Some examples of the normalized weights among the 
original tags and noise tags. The noise tags are in red, 
and the numbers are the corresponding weight values. 


"Bug isolation via remote program sampling 1241 " 

Ben Liblit: 0.185 Alex Aiken: 0.2257 

aAlice X. Zheng: 0.228 Michael I. Jordan: 0.349 

K. G. Shin: 0.01 

"Web question answering: is more always better? Il8l " 
Susan Dumais: 0.986 Michele Banko: 0.0032 

Eric Brill: 0.0038 Jimmy Lin: 0.0038 

Andrew Ng: 0.0024 
R. Katz: 0.00018 

"Contextual search and name disambiguation in email 
using graphs l27l " 

Einat Minkov: 0.425 William W. Cohen: 0.342 

Andrew Y. Ng: 0.128 

/. Ma: 0.033 D. Ferguson: 0.07 

"A Sparse Sampling Algorithm for Near-Optimal Planning 

in Large Markov Decision Processes l22l " 

Michael Kearns: 0.296 Yishay Mansour:0.166 

Andrew Y. Ng: 0.31 
/. Blythe: 0.089 B. Adida: 0.027 

P. J. Modi: 0.1 

"The nested Chinese restaurant process and bayesian 
nonparametric inference of topic 0" 

David M. Blei:0.46 Thomas L. Griffiths:0.186 

Michael I. Jordan:0.225 
B. Cliffords.031 R. Szeliski:0.048 

X. Wang:0.05 


per iteration of the three solutions compared with 
the standard TWTM as the baseline, when we set the 
number of topic K = 10, 20 and 50 respectively. 

Secondly, We sampled the training dataset from 
the Wordpress corpus with different sample ratios, 
0.1, 0.3, 0.6, 0.8 and 1.0, to show the performance 
of running time by different size of training dataset. 
In addition, we limited the maximum number of 
Mappers in the configuration when we trained the 



Fig. 9. (a) The average training time per iteration 

for Solution I, II, III with different number of topics 
compared with the standard TWTM. (b) The perplexity 
results for Solution I, II, III, and the standard TWTM. 
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(a) K = 20 (b) K = 50 

Fig. 10. The average training time per iteration on the 
Wordpress corpus with different number of sampling 
radios for Solution I, II, III. 



(a) K = 20 



(b) K = 50 


Fig. 11. The average training time per iteration on the 
Wordpress corpus with different number of Mappers for 
Solution I, II, III. Note that the horizontal axis repesents 
the maximum number of Mappers used in a training 
task. 


model as described in Section 17.11 to demonstrate 
the comparison performance of the three solutions 
under the restricted resources. Figure ITTiland Figure ITT! 
show the results about the average training time 
per iteration of the three solutions using different 
sample ratios and Mappers of dataset when training, 
by setting the number of topic K = 10, 20 and 50 
respectively. From this part of experiments, we find 
that Solution II has a better performance of efficiency 
than Solution I and III. 

Meanwhile, in order to compare with other model, 
such as PLDA, we used the Wordpress dataset with 
1,000 tags to train a PLDA model with Ki = 1 (we 
used the code from Stanford Topic Modeling Toolbox). 
We trained TWTM by Solution II with K = 5. Table [3] 
shows the comparison of PLDA and TWTM by Solu¬ 
tion II. 


TABLE 3 

The average training time (second) per iteration for 
Solution II and PLDA 


Sampling radio 

0.1 

0.3 

0.6 

0.8 

1.0 

PLDA 

66.6 

114.8 

193.4 

250.6 

276.4 

Solution II 

77.6 

88.6 

104.8 

116.2 

120.8 


As described in Section [6j in Solution I, it would 
spend a great deal of time on data migration to 
update 7r in Driver process, and in Solution III, a 
lot of resources are taken on the clustering process 
in each Mapper, especially when the corpus is non- 
homogeneous which leads to uneven loading of each 
Mapper. While, Solution II avoids these problems by 
a approximation method. 

Lastly, we measured the generalization capability of 
the three solutions using the perplexity and conducted 
experiments. We held out 20% of the data for test and 
trained the three solutions on the remaining 80%. We 
observe that there is relatively little difference among 
the solutions compared with the standard TWTM in 
terms of perplexity as shown in Figure |9(b) when 
the number of topic increases. That is, all the three 
solutions are good approximations in terms of model 
fitness. It is worthy to note that Solution II has almost 
the same performance as Solution I and Solution III. 


8 Conclusion 

With the tag-weighted topic model proposed in the 
paper, we provide and analyze a probabilistic ap¬ 
proach for mining semi-structured documents. Mean¬ 
while, three distributed solutions for TWTM are pre¬ 
sented to handle the large scale problems. Besides, 
TWTM is able to obtain the topics distribution of tags 
in the corpus, which is very useful for text classifica¬ 
tion, clustering and other data mining applications. At 
the same time, we propose a novel framework of pro¬ 
cessing the tagged text with a high extensibility, and 
uses a novel function of tag-weighted topic assign¬ 
ment of documents. As an extended model, TWDA 
shows the capability on handling the mixture corpora 
of semi-structured documents and unstructured doc¬ 
uments. The second benefit of the tag-weighted topic 
model is that it allows one to incorporate different 
types of tags in modeling documents, and provides 
a general framework for multi-tag modeling at not 
only the level of tags but also the level of documents. 
It helps provide a different approach in classification, 
clustering, recommendation, and so on. For large scale 
semi-structured documents, the proposed solutions 
are shown to be effective and efficient for some 
complex web applications. In the future, we plan to 
apply TWTM to different practical areas (e.g., image 
classification and annotation, video retrieval). 
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In the topic models, the key inferential problem that we need to solve is to compute the posterior distribution 
of the hidden variables given a document d. Given the document d, we can easily get the posterior distribution 
of the latent variables in the proposed model, as: 


p(£ d ,z,\w d ,T d , 9,r],il>,n) 


p{e d ,z,w d ,T d \e,ri,il>,Tr) 
p(w d , T d \9 1 r], ip,n) 


Integrating over e and summing out z, we easily obtain the marginal distribution of d : 


N K 


p(w d ,T d \ V ,0,^n)=p(t d \r ] ) fp(e d \(T d x tt)) • f[ £ p(z d \(e d ) T x T d x d)p(w d \z d ^ 1:K ) de d . 

J - 1 rl 


2=1 z d =l 


In this work, we make use of mean-field variational EM algorithm to efficiently obtain an approximation 
of this posterior distribution of the latent variables. In the mean-field variational inference, we minimize 
the KL divergence between the variational posterior probability and the true posterior probability through 
by maximizing the evidence lower bound (ELBO) £(•). For a single document d, we obtain the £(•) using 
Jensen's inequality: 


£(€l:ld,'yUK;r]l:L,'!ri:L,8l:L,‘ll’l:K) = E[logp(T 1:l d\r] 1 .. L )\ + E[logp(£ d \T d X 7t)] 

N N 

+ ^2 Efiog p(zi\{£ d ) T x T d x 0)} + Y^E[].ogp(wi\zi,ip 1 :K)\ +H(q), 

i= 1 2=1 

where £ is a /^-dimensional Dirichlet parameter vector and 7 is 1 x K vector, both of which are variational 
parameters of variational distribution. H(q) indicates the entropy of the variational distribution: 

H(q) = -E[\ogq(e d )}-E[logq(z)]. 

Here the exception is taken with respect to a variational distribution q(£ d , z\-n), and we choose the following 
fully factorized distribution: 

N 

q{£ d ,Zl, N \£,l-.L,11:K) = 4(^10 Ih^l 7 *)’ 

2=1 

where the dimension of parameter £ is changed with different documents. 

In the £(•), 

K 

E[\ogp(zi\(£ d ) T x T d x 0)] = ^ 7 ifc £[log((£ d ) T x T d x 0) k \. 

k=l 


To preserve the lower bound on the log probability, we upper bound the log normalizer in £'[log((e d ) T x 
T d x 9)k\ using Jensen's inequality again: 


£[log(( £ d ) T x T d x 9) k ] = E[ log^e^] > £[$>? log^ } ] = X>g O^Etf], 


where the expression of 9 b) , i £ { 1, • • • , l' 1 }, means the t-th tag's topic assignment vector, corresponding to 

the i-th row of 0 d . Note that the expectation of Dirichlet random variable is E[ef] = — pr —• 

J2j=i fj 


Thus, for the document d, 


N 

Y,E[logp(zi\(£ d ) T xT d 
2=1 


N K 


x 0 )] = EE 

2=1 k= 1 


Eiog 0 i J) 


§i 

E/=i tj' 
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Finally, we expand £(•) in terms of the model parameters (//. tt, 0. ip) and the variational parameters (£, 7 ) 
as follows: 


£(£, 7 ; 77 , TT, 0, V>) = log 7 + (1 - if) log(l - 7?f)) 

Z = 1 

Z d l d l d 

+ logT(E (T d x tt),) - ^logr ((T d x tt),) + £(( T d x tt), - 1) I *(£) - *(E&) 


2=1 


2=1 




i=i 


N K l d N K V 

EE^E lo ^l 3 ) ^V+EEE 7 ik(w d )i log ipkj 


2=1 k=l j— 1 Ej'=l £/ 2=1 fc=l J = 1 

l d l d 


N K 


- iognE&)+E logr( &)- E(& _ !) (*&)-*(E^')] -EE^ lo «^' 

/=i 


2=1 




2—1 


2=1 /c=l 


Then, we maximize the lower bound £(£ 7 ; 77 , 7 r, 0, £) with respect to the variational parameters £ and 7 , 
using a variational expectation-maximization(EM) procedure as follows. 

A.1 Variational E-step 

A . 1.1 £ 

We first maximize £(•) with respect to £ for the document d. Maximize the terms which contain £ 


r l 


N K 


7i = E<E - ‘I I *(&> - *(E 7) I + E E« ■ E 

*=1 i' = l V /=1 / i=lfe=l i=l Z^j' = lSj 


-i og r(E& 


+ E>s r &) - E& -!) *&) - *(E 7) ’ 

2=1 2=1 V i'=l / 

where T'(-) denotes the digamma function, the first derivative of the log of the Gamma function, and (T d x n) i 
= Yli= 1 12 1=1 niTu- The derivative of £^ with respect to £ is 

l d l d L N K /l_ Q (i)/^l d tA 1 


1 °gflfc ) (Sj=i&) -£q-=:l lQ g 

, Ev =1 £' 2 


£'(£) = ^texE 7 ^ -&) - *'E&) - EE 7 ^* -y+EEi' 

7=1 i=l 7=1 7=1 «'=1*5=1 \ £^jt =lSj 

Here we use gradient descent method to find the £ to make the maximization of C a. 

A . 1.2 7 

Next, we maximize £(•) with respect to 7^. Adding the Lagrange multipliers to the terms which contain 7,/,, 
we get the following equation: 


N K 


N K 


N K l d N K V 

{j) —-—"EEE lik{w d )\ log Ipkj EE lik log 7^ tk + EmE 7ifc - !)• 


e = EE 7ife E lo S -id- 

2=1 /c=l j = l — 1 A? 


2=1 fc=l J = 1 


2=1 fe=l 


2=1 /C=l 


By taking the derivative with respect to 7 ^, and setting the derivative to zero yields, we obtain the update 
equation of 7 ^: 

i d 

^ . 

7ifc OC 7/7 fe ,„»2 exp{E log0fc } ,7 }, 

i =1 E'=i sj' 


where tE’’ denotes the index of in the dictionary. 


A.2 M-step estimation 

The M-step needs to update four parameters: 77 , the tagging prior probability, n, the Dirichlet prior of the tags' 
weights, 0 , the topic distribution over all tags in the corpus, and ip, the probability of a word under a topic. 
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A.2.1 t] 

For a given corpus, the ry,; is estimated by adding up the number of i th label which appears in all documents. It 
does not depend any parameter in the proposed model, except itself. By maximizing the terms which contain 
?y, we have 


V i = 


EE? 


D 


where D is the size of corpus. Because each document's tags-set is observed, the Bernoulli prior rj is unused, 
which is included for model completeness. 


A.2.2 7T 

For the document d, the terms that involve the Dirichlet prior 7r: 

i d i d i d / i d \ 

C M = lo g r(]T ( T d x tt),) - ^logr (( T d x 7T).) + X] (( T d x tt). - l) $(6) - &) . 

i=l i =1 i=l \ j-1 ) 

We use gradient descent method by taking derivative of jCu-i with respect to 717 on the whole corpus to find 
the estimation of tt. Taking derivatives with respect to ~i on the corpus, we obtain: 

= E E• 4)• E n - E E n± -4)-^+EE (*(&>- $ ©) • # 

d=l i=l i'=l »=1 d=l i=l ;'= 1 d=l i=l y j = l y 


A2.3 9 

The only term that involves 9 is: 


C \8] 


D N K l d 

EEE^E 1 o s 4 


d—1 2=1 /c=l J = 1 


0 

E/=1^' 


where £j, j € {1, • • • , Z d } in the document d needs to be extended to tf ■ £f, l g {1, ■ • • , L} for convenient to 
simplify C^. With the Lagrangian of the C[g\, which incorporate the constraint that the K-components of 9i 
sum to one, adding XEi ^(Efc=i ®ik — 1) to C\g\, taking the derivative with respect to 9 lk , and setting the 
derivative to zero yields, we obtain the estimation of 9 over the whole corpus. 


D N 


Oik <x e E 

d—1 2=1 


cd+d 

_ kh _ 

Eti(W)' 


A.2.4 ip 

To maximize with respect to ip, we isolate corresponding terms and add Lagrange multipliers: 

D N K V K v 

a,i = EEEE 7 ik{w d ) 3 i log Ipkj + E Afc (E tpkj - 1). 

d—1 2=1 k—1 j= 1 k— 1 j—1 

Take the derivative with respect to ^kj, and set it to zero, we get: 

D N 

^i° c EE^(® <i )i' 

d—1 2=1 
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Appendix B 

Tag-Weighted Dirichlet Allocation 


In TWDA, we treat n, fi, //, 9 and ip as unknown constants to be estimated, and use a variational expectation- 
maximization (EM) procedure to carry out approximate maximum likelihood estimation as TWTM. Given the 
document d, we can easily get the posterior distribution of the latent variables in the TWDA model, as: 


p(e d , X d , z|w d , T d , 9, p, ip, n, p) = 


p(e d , \ d , z, w d , T d \9, rj, ip, t r, p) 
p(w d , T d \9, ip 1 p,TT,p) 


As with TWTM, it is not efficiently computable. We maximize the evidence lower bound(ELBO) £(-)using 
Jensen's inequality, and for a document d we have the form: 

P 1 :K;V 1 -.L,TT 1 :L, Vl:K,0l:L,1pl:K) = E[logp(T 1:l d\ t) 1 :l)] + E\\ogp(s d \T d X 7r)] + E[\ogp{\ d \p)} 

N N 

+ ^2,E[\ogp{zi |(e d ) T x T d x (j))]+^2E[\ogp{wi\zi,ip 1:K )} 

i =1 i =1 

+ H(q), 

where £ is a /^-dimensional Dirichlet parameter vector, p is a lx K vector and 7 is 1 x K vector, all of which 
are variational parameters of variational distribution. Unlike the TWTM, l d in TWDA is one more than the 
number of the observed tags in the document d. H{q ) indicates the entropy of the variational distribution: 

H(q) = -E[\ogq{e d )} - £[log<z(A)] - E[logq(z)}. 

Here the exception is taken with respect to a variational distribution q{e d ,q(X d ), z\,n), and we choose the 
following fully factorized distribution: 

N 

q{e d , x d , z 1:N \£ 1:L , p 1:K , 71 :k) = q(£ d \&q(x d \p) q{ z i\ii)- 

i— 1 

The term of the expected log probability of the topic assignment: 

f) K Q 

E[\ogp{ Zi \{e d ) T x T d x (-))] = Y,^E[\og{{e d ) T xT d x (-))*], 

k=i 

which could be difficult to compute, because of tag-weighted topic assignment which is used in TWDA. Thus 
we use Jensen's inequality: 

l d — 1 

U[log(( £ d ) T x T d x (^)) fc ] = E[log(^ e d ef+e d d X k )\ 

i= 1 

i d ~ 1 

- £ i l0g ^ + £ l d ' l0g Afc l 
1 = 1 
l d ~ 1 

= lo S d k )E i £ i] + E i s l d • lo SAfc], 

i= 1 

where the expression of i £ {1, • • • ,l d — 1}, means the f-th tag's topic assignment vector, corresponding 
to the i-th row of (~) d . 

because the variational distribution is fully factorized, so we can get: 

l d — 1 

U[log((s d ) T x T d x (^)) fc ] = £ log efE[e d ] + E[ef d ] ■ E\log\ k ], 

i= 1 

where 

i d 

£[4] = w£&> 

3 =1 

K 

U[logA fc ] = T'(pfe) - pj>). 

/ =1 
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With E[ef\ = — ,1* , Thus, for the document d, 

I2j=i fi 


N „ N K r -1 K c 

5>[logp^|(e d ) T x T d x (-))] = ££> • [£ log^—+ (tf(p fc ) - *(£ ^ 


2—1 /c —1 


i=i 


S/=i £?•■ 


3 =1 


e 7 -=i £# 


Then we expand the £(•) of TWDA as follows: 


where 


and 


£(£, 7, p; r}, 7 T, p, 9,i/j) = £(ff log rif + (1 - if) log(l - r?f)) 

Z=1 

l d l d l d 

+ logT(^ (T d x tt),) - £lo g r((T d x tt),) + £((T d x tt). - 1 ) I *&) 


2=1 

if 


if 


if 


if 


logr^pj) - £iogr(pi) + £(pi -1) 'F(pf) - ^(£ 


Pj) 


3 =1 <=1 i=l \ i=l 

AT AT l d f N K V 

EE^-E^^V+EEE 7 ik{w d ) 3 i log Ipkj 
i=l k—1 j=l E/ = 1»=1 fc=l j=l 

Z d l d l d 


- iogr(£6) + E lo s r (£«) - £& -1 ) *&) - *(£ <?') 


2=1 2=1 


N if 


-EE 7 ik ■ log 7 ik 


2= 1 fc=l 


if 


if 


if 


J =1 


if 


- logr(^pj-) + Ei°gr(Pi) - £(p< - x ) ^(p*) - *(£p,-) 


V 


j=i *=1 


2—1 


1 = 1 ) 


cf = 


U) 


_ ( log 61 

l^pO-^EjLrPi') J=« a 


j €{!,••• ,z d 1 } 


i+i 


(T d x tt), = X] nT$. 


1=1 


l d \ 

*(£&•)) 


B.1 Variational E-step 

For a single document d, the variational parameters include £ d , p d and 7 ^. First, we maximize £(■) with 
respect to the variational parameters to obtain an estimate of the posterior. 


B.1.1 Optimization with respect to £ 

We first maximize £(•) with respect to £ for the document d. Maximize the terms which contain <f: 


r l +1 


N K 


% = E(E^'- 1 ) *&)-*(££/) + EE^-Ee 


(3) €3 


* =1 1 '=1 


i =1 


fc £, 
2=1 /c=l J = 1 Z—/J =1 Sj 


-io g r(£ 6 ) 


2=1 


+ £iogrfe)-Efe-i) h'te)-'E'(E^) h 

2=1 2=1 V /=1 / 

The derivative of with respect to £ is 

L +1 l d l d L+l N K 

£'&) = *'&)(£ ^ - *'(£ &) £(£ ^ - 6 ) + E E ^ fc 

1=1 j= 1 2=1 Z=1 i'= 1 fe=l 

Here we use gradient descent method to find the £ to make the maximization of C^. 




j=i y? 


(EEi^-o 2 



20 


B. 1.2 Optimization with respect to p 

Next, we maximize £(•) with respect to p. The terms that involve the variational Dirichlet p are: 

K / K \ K K K / K 

c \p\ = E(^ _1 ) -iogr(Eft) + Eiogrte)-E(^-D *(ft)-*(E^) 

\ 3 = 1 / 3= 1 7=1 i=l V 7 = 1 


2=1 
AT AT 




a: 


' E E 7ifc ^ , 

fc=l 2=1 / J 7 = 1 


*(pk)-*(E^-) I • 

i-i ^ V i =1 


This simplifies to: 


6 


if / K \ / N 

£-[p\ = ( ^(P®) — Pi) ) ( P* — Pi + ' TTF 

i=l / V 7 l=l Z_/ 7 = 1 S; 


re if 


-iogr(Epi)+E lo s r (p®)- 

j=l Si / j=l 7=1 


Taking the derivative with respect to pi and setting it to zero, we obtain a maximum at: 


N 


Pi = Pi + E 7i 


£i d 

m jd 

n=l £=l& 


B. 1.3 Optimization with respect to 7 
The terms that contain 7 are: 


AT if l d N if V N if 

= 7ifeW'y log EE 7ifc • log 7 ik 

7=1 fc=l 7=1 Z^7 = l Sj 7=1 fc=l 7 = 1 


7=1 fc= 1 


Adding the Lagrange multipliers to the terms which contain 7 ^, taking the derivative with respect to 7 ^, 
and setting the derivative to zero yields, we obtain the update equation of 7 ^: 
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where v Wi denotes the index of ui, in the dictionary. 

In E-step, we update the 0 p and 7 for each document with the initialized model parameters. 


B.2 M-step estimation 

The M-step needs to update five parameters: 77 , the tagging prior probability, 7 r, the Dirichlet prior of the tags' 
weights, 9, the topic distribution over all tags in the corpus, i/>, the probability of a word under a topic, and 
/ 1 , a Dirichlet prior of model. It is worthy to note that we update 7 / with the same method as in TWTM. 

B.2.1 Optimization with respect to 7 r 

For the document d, the terms that involve the Dirichlet prior 7r: 

i d i d i d / i d \ 

£ W = logT(E ( T d x tt),) - El«gr((T d x tt),) + E((^ * tt), - 1) *(&) - tf(E&) , 

7=1 7=1 j=l \ 7 = 1 J 

where ( T d x 7r) i = m/7. We use gradient descent method by taking derivative of £[, r ] with respect to 

tti on the corpus to find the estimation of 7 r. Taking derivatives with respect to 7 r; on the whole corpus, we 
obtain: 


d r l +1 


z> r l+i 


d r 


4.1 = E $ (EE’ r r^)-E^-EE $ (E ir r^)^+EE *&)-*(E&•) ] • ?i 

d= 1 7=1 i'=i 7=1 d=l 7=1 ;' = 1 d=l 7=1 \ j=l 



B.2.2 Optimization with respect to 6 

The only term that involves 6 is: 
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C[8] 


D N K l d 

eee^E 10 ^ 


d—1 2=1 k =1 j =1 


0 


where £j, j £ {1, - - - , I ' 1 } in the document d needs to be extended to tf ■ tpf, l £ {1, • • • ,L + 1} for convenient 
to simplify C\gy With the Lagrangian of the C[g ], which incorporate the constraint that the K-components of 
6i sum to one, we obtain the estimation of 6 over the whole corpus. 
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B.2.3 Optimization with respect to ip 

To maximize with respect to ip, we isolate corresponding terms and add Lagrange multipliers: 

D N K V K v 

C W = EEEE 7 ikiw d ) 3 i log ipkj + E Afc (E Ipkj - 1)- 

d—1 2=1 k=l j= 1 k—1 j—1 

Take the derivative with respect to ijjkj over the whole corpus, and set it to zero, we get: 

D N 

i’kj «EE«- 

d—1 2=1 


B.2.4 Optimization with respect to p 

For the Dirichlet parameters //, the involved terms are: 


£ [m] = 


D / K K 
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Taking the derivative with respect to fii gives: 
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3 = 1 


We can invoke the linear-time Newton-Raphson algorithm to estimate // as same as in LDA. 
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Appendix C 

Cluster Algorithm in Solution III 

As shown in Eq. 4,7r;, l £ (1, • ■ • ,L) is only associated with the documents who contain the I th tag. Thus, before 
running TWTM, we can cluster the documents into several clusters with the condition that the documents 
which contain the same tags should be in the same cluster. It means that the documents are divided into the 
mutually independent space by the tags. We show a simple example as shown in Figure [12j left panel. 
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di 


d*2 
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Fig. 12. Left: An example of the clustering result. Each row represents a document d in a corpora D, and Each 
column represents a tag t. Dij = 1 means that t, is given in di. The documents in the red circle belong to one 
cluster, and the documents in the blue circle belong to another cluster. Right: The illustration to update 7r by 
combine the different parts. 

After document clustering, the tags contained in one cluster are not appeared to any other clusters. In this 
case, we could assign each cluster to different computed nodes. When update the 7r, we just simply combine 
the 7 r c where c £ (1, • • • , C) and C is the number of document clusters, just as shown as in Figure [T2j right 
panel. We show the cluster process of Solution III in Algorithm [2] 



Algorithm 2 The cluster process of Solution III 

1: Input: a semi-structured corpora D = {(w 1 ,! 1 ),..., (w M , t M )} and the tag set T of the corpora. 

2: Output: a cluster set C that contains all the clusters, and each cluster c in C contains a set of documents. 
3: create a cluster set C = {}. 

4: create a document cluster c = {}. 

5: create pre_added_docs = {} to store the documents which are ready to add into cluster c. 

6: create a tag set scanned_tags = {} to store the tags which have been scanned. 

7: add c into C. 

8: for each tag t in T do 
9: if t is not in scanned_tags then 

10: add t into scannedjtags ; 

11: create a new cluster c, and add c into C; 

12: else 

13: continue; 

14: end if 

15: add the documents which own t into pre_added_docs; 

16: repeat 

17: for each d in the pre_added_docs do 

18: if d is not in c then 

19: add d into c; 

20: end if 

21: for each tag t d in d do 

22: add t d into scannedjtags ; 

23: add the documents which have t d and not in c into pre_added_docs ; 

24: end for 

25: remove d from pre_added_docs ; 

26: end for 

27: until pre_added_docs is empty. 

28: end for 
29: return C 










































